Note: I have only done the EDA to answer the asked questions. I have not done any EDA for the purpose of feature engineering or feature selection.
## checking_account_status duration_in_months
## 0 0
## credit_history purpose
## 0 0
## credit_amount savings_account_status
## 0 0
## present_employment_since installment_as_percent_of_income
## 0 0
## marital_sex_type role_in_other_credits
## 0 0
## present_resident_since assset_type
## 0 0
## age other_installment_plans
## 0 0
## housing_type count_existing_credits
## 0 0
## employment_type count_dependents
## 0 0
## has_telephone is_foreign_worker
## 0 0
## is_credit_worthy
## 0
So, no missing data. Yayyy!
Before going into exploring relationship of predictors with the target, let’s first clearly define the target
Credit worthiness for a group of observations can be measured by Good/Total proportion. Higher the proportion, higher the credit worthiness
Question: Would a person with critical credit history, be more credit worthy?
Again, let’s first define what critical means. In the absence of any concrete definition, I will assume ‘critical’ roughly means more existing credits i.e. it increase from A30 to A35
Critical has positive association with credit worthiness
Q. Are young people more creditworthy?
The distributions are quite overlapping. But there are more young in “Bad” compared to “Good”, and that is also visible in the difference in means. > So, young people seem slightly less credit worthy.
But let’s break the age into groups to see finer details
“Bad” is quite low for the (34, 39] age group
Q. Would a person with more credit accounts, be more credit worthy?
I am assuming more credit accounts is same as “Number of existing credits at this bank” i.e. ‘count_existing_credits’
Data is too unreliable to say anything on the relationship between no. of credit accounts and credit worthiness
As mentioned earlier I didn’t do any EDA from featre engineering perspective. So, there is no feature engineering.
For feature selection I have used Boruta, which I have found to be the best feature selection technique almost always. Below is how the Boruta plot looks like:
Selected features are:
## [1] "checking_account_status" "duration_in_months"
## [3] "credit_history" "purpose"
## [5] "credit_amount" "savings_account_status"
## [7] "present_employment_since" "installment_as_percent_of_income"
## [9] "role_in_other_credits" "assset_type"
## [11] "age" "other_installment_plans"
## [13] "housing_type" "employment_type"
## [15] "is_credit_worthy"
It is worse to class a customer as ‘Good’ when they are ‘Bad’, than it is to class a customer as bad when they are good.
Let ‘Good’ be the positive class, and ‘Bad’ be the negative class. So the above statement will translate to:
False Positives (FPs) are more expensive than False Negatives (FNs)
Such cases fall under **Cost Sensitive Learning" strategy, and followong sub-strategies can be followed decided under it:
I will try the following three models: - Logistic Regression - Boosted Trees: GBM - Random Forest
I will go with a Custom evaluation metric:
I have assigned follwing weights to different buckets of the confusion matrix to penalize each bucket differently
## Reference
## Prediction Good Bad
## Good -0.4 1
## Bad 0.2 0
There is no particular reason for these values, just their relative differences are important because they penalize FPs more than FNs. PLus, I am rewarding TPs (True Positives)
Now, the custom metric is just the normalized sum-product of these weights and the confusion matrix of the model. Let’s call it “credit_cost”.
I have 80:20 splitting. For validation, I will be using cross-validation wherever required.
I am taking baseline as predicting everybody as "Good’
Train credit_cost
## Baseline Train Cost: 0.0206982543640898
## Baseline Train Precision: 0.699501246882793
Test credit_cost
## Baseline Test Cost: 0.0171717171717172
## Baseline Test Precision: 0.702020202020202
Train Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 518 116
## Bad 43 125
##
## Accuracy : 0.802
## 95% CI : (0.772, 0.829)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.0000000000343
##
## Kappa : 0.484
##
## Mcnemar's Test P-Value : 0.0000000112995
##
## Sensitivity : 0.923
## Specificity : 0.519
## Pos Pred Value : 0.817
## Neg Pred Value : 0.744
## Prevalence : 0.700
## Detection Rate : 0.646
## Detection Prevalence : 0.791
## Balanced Accuracy : 0.721
##
## 'Positive' Class : Good
##
Train Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 543 16
## Bad 18 225
##
## Accuracy : 0.958
## 95% CI : (0.941, 0.97)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.899
##
## Mcnemar's Test P-Value : 0.864
##
## Sensitivity : 0.968
## Specificity : 0.934
## Pos Pred Value : 0.971
## Neg Pred Value : 0.926
## Prevalence : 0.700
## Detection Rate : 0.677
## Detection Prevalence : 0.697
## Balanced Accuracy : 0.951
##
## 'Positive' Class : Good
##
Train Results:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 535 52
## Bad 26 189
##
## Accuracy : 0.903
## 95% CI : (0.88, 0.922)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.761
##
## Mcnemar's Test P-Value : 0.00464
##
## Sensitivity : 0.954
## Specificity : 0.784
## Pos Pred Value : 0.911
## Neg Pred Value : 0.879
## Prevalence : 0.700
## Detection Rate : 0.667
## Detection Prevalence : 0.732
## Balanced Accuracy : 0.869
##
## 'Positive' Class : Good
##
Credit_cost and Pricision are in sync.
train results are best for GBM. But its overfitting, i.e. variance is high, so not that great results on test.
test results are best for Random Forest. It has less variance then GBM, but bias is higher.
It may seem like that GBM is a better model, but we still haven’t seen the uncertainity (variance) in the results. Difference between train and test set results give some idea about it, but its better to see it on cross-validated results.
Not much difference here too, DRF seems only slightly better but that may change with fold assignment. For GBM, I did positive class upsample tuning but didn’t tune other hyperparameters. And for DRF I did the exact opposite. So, both the models have a lot of scope of tuning, and I am not at a stage to pick the right model
We can see feature importance of either GBM or DRF, but DRF gives a better plot without breaking categorical features into its classes, so we will use DRF.
Topp-3 features are “checking_account_status”, “duration_in_months”, and “credit_amount”
To profile a ‘Good’ credit worthy person as per the model, let’s explore the relationship of top predictors with the predicted class for the DRF model.
So, the best credit worthy person would have a following profile:
- checking_account_status is “A14” i.e. no checking account
- duration_in_months is less than 12 month i.e. a year
- credit_amount is less than 2k
- credit_history is “A34” i.e. critical account/other existing credits
- Purpose is A43 i.e. radio/television
This seems slightly unintuitive, but I will have to go into model explainibility to get better insights, and currently the time is short for that